A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage #37

akansha1812 · 2025-11-03T23:30:41Z

Add complete helm chart with readme and tests the scripts.

TODO: update src/helm-charts/storage/gcs-fuse/templates/pv.yaml with comment when to add machine-type:a3-highgpu-8g based on b/450059657#comment27

mkmg · 2025-11-13T21:34:43Z

training/a4/llama3-1-70b/nemo2-pretraining-gke/README.md

+```
+cd $REPO_ROOT/src/utils/checkpointing_metrics
+python3 calculate_checkpoint_metrics.py --gcs_logs_path=${GCS_LOGS_PATH}
+```


Did you test this? I'm not sure if it has been updated to work with Nemo 2.

removed this.

mkmg · 2025-11-13T21:35:17Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

Let's update the file path to match the other recipes in this directory with a "-gcs" suffix.

mkmg · 2025-11-13T21:35:44Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

@@ -0,0 +1,303 @@
+<!-- mdformat global-off -->
+# Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework  using Google Cloud Storage for training data and checkpoints


mkmg · 2025-11-13T21:36:06Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

@@ -0,0 +1,303 @@
+<!-- mdformat global-off -->
+# Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework  using Google Cloud Storage for training data and checkpoints


Is there supposed to be 2 spaces here?

mkmg · 2025-11-13T21:36:48Z

training/a4/llama3-1-70b/nemo2-pretraining-gke/README.md

+
+### Configure and submit a pretraining job
+
+#### Using 16 node (64 gpus) fp8 precision


Is if fp8 or bf16?

bf16. updated

mkmg · 2025-11-13T21:53:08Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

+### Analyze results
+
+When completed, the job creates several artifacts, including logs and traces, and places them
+in the  Google Cloud Storage logs bucket as follows:
+
+```
+gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/<JOB_ID>
+├── nemo-configuration.yaml
+├── lightning_logs.txt
+├── nemo_error_logs.txt
+├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt
+├── dllogger
+│   ├── rank-0
+│   │   ├── dllogger.json
+...
+```
+
+- `nemo-configuration.yaml`: the NeMo configuration used by the pretraining script. This includes
+   the combined [configuration file](../16node-bf16-seq8192-gbs512/llama3-1-70b.py)
+   and the command line overrides
+- `lightning_logs.txt`: the log files generated by PyTorch Lightning, which is used by NeMo
+- `nemo_error_logs.txt`: the warning and error logs generated by NeMo
+- `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt`: the NeMo logs for each rank
+- `dllogger/`: The log captured by [NVIDIA DLLogger](https://github.com/NVIDIA/dllogger):
+   DLLogger is configured to store logs on the rank 0 node. The log is in JSON format
+   and includes loss, step_time, and other key metrics for each training step
+
+The `<JOB_ID>` has the following format:
+- `$USER--llama31-70b-gcs-[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]`, where the suffix of the ID is a day and time when the job was started.
+
+
+The NeMo log files include information about checkpoint operations on each rank. You can use the [checkpointing_metrics](../../../../src/utils/checkpointint_metrics) utility to calculate statistics for checkpoint write times.
+
+To calculate statistics:
+
+
+1. Set a path to the NeMo logs.
+
+```
+export JOB_ID=<JOB_ID>
+export GCS_LOGS_PATH="gs://${GCS_BUCKET_LOGS}/nemo-experiments-storage/${JOB_ID}"
+```
+
+Replace `<JOB_ID>` with the ID of your job.


This section seems to end abruptly. I'm ok if we don't want to update the checkpoint metrics utility, but we should at least tell users where they can find this data in the logs.

akansha1812 · 2025-11-17T19:29:31Z

src/helm-charts/storage/gcs-fuse/templates/pv.yaml

-    - file-cache:enable-parallel-downloads:true
-    - file-system:kernel-list-cache-ttl-secs:0
    - write:enable-streaming-writes:true
+    - machine-type:a3-highgpu-8g


TODO: Add a comment on which version the fix has been applied and we need to version for which this workaround is required

"These gcsfuse versions" -> "Earlier GCSFuse versions"

mkmg · 2025-11-17T20:26:20Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

-```

-Replace `<JOB_ID>` with the ID of your job.
+The NeMo log files include information about checkpoint operations on each rank. Users can find checkpoint read and write informatiom in `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt` files.


"information"

Can we just tell them to check rank 0?

mkmg · 2025-11-19T17:19:31Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/values.yaml

+  envs:
+  - name: GLOO_SOCKET_IFNAME
+    value: eth0
+  gcsSidecarImage: gcr.io/gcs-tess/ashmeen/gcs-fuse-csi-driver-sidecar-mounter:v3.2.0_test


We shouldn't specify the sidecar image since this image is not publicly available.

mkmg · 2025-11-19T17:20:44Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

+  - Kueue and JobSet APIs installed
+  - Kueue configured to support Topology Aware Scheduling
+- A regional Google Cloud Storage (GCS) bucket to store logs.
+- A regional Google Cloud Storage (GCS) bucket with [hierarchical](https://cloud.google.com/storage/docs/hns-overview)) namespace to store the Pile dataset


There is an extra ")" on this line.

mkmg · 2025-11-19T17:22:26Z

training/a4/llama3-1-70b/nemo-pretraining-gke/16node-bf16-seq8192-gbs512-gcs/README.md

+   - Helm
+   - kubectl
+
+*Important: All GCS buckets must be in the same region as the GKE cluster*.


Use two ** to make this bold?

akansha1812 added 6 commits November 3, 2025 23:27

A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage

2614505

update project-id

bb9977f

remove default values

ada2196

files will be set as part of helm command

00c8b7a

test all the commands and fix

668a204

fix typo

4a85a19

mkmg reviewed Nov 13, 2025

View reviewed changes

akansha1812 added 4 commits November 13, 2025 21:42

resolve comments

0fd8ed4

gcs suffix

0e7c510

resolve comments

327947d

add gcs suffix

0d94e7b

mkmg reviewed Nov 13, 2025

View reviewed changes

akansha1812 added 3 commits November 14, 2025 07:41

change dataset path

555c87d

update readme

658c35b

update readme

e8bda69

akansha1812 requested a review from mkmg November 17, 2025 19:16

akansha1812 commented Nov 17, 2025

View reviewed changes

mkmg reviewed Nov 17, 2025

View reviewed changes

update readme

e88b14c

akansha1812 requested a review from mkmg November 19, 2025 01:04

machine-type comment

2cf2379

mkmg reviewed Nov 19, 2025

View reviewed changes

		@@ -0,0 +1,303 @@
		<!-- mdformat global-off -->
		# Pretrain llama3-1-70b-gpus128 workloads on a4 GKE Node pools with Nvidia NeMo Framework using Google Cloud Storage for training data and checkpoints


		### Configure and submit a pretraining job

		#### Using 16 node (64 gpus) fp8 precision

A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage #37

Are you sure you want to change the base?

A4 Llama 3.1 70B recipe on NeMo 2.0 with GCSFuse storage #37

Uh oh!

Conversation

akansha1812 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akansha1812 commented Nov 3, 2025 •

edited

Loading